Large-Scale Collective Entity Matching

نویسندگان

Vibhor Rastogi

Nilesh N. Dalvi

Minos N. Garofalakis

چکیده

The difficult problem of Entity Matching (EM), i.e., determining whether different mentions of an entity refer to the same realworld object, has recently attracted a lot of attention from the Machine Learning (ML) research community. State-of-the-art solutions for EM are based on recent advances in ML, such as firstorder probabilistic models (e.g., Markov and Bayesian networks) and advanced probabilistic inference techniques. A key benefit of these ML tools comes from their purely collective nature: match evidence for related entities can be collectively reinforced into high-probability EM decisions. In addition, they provide a principled framework for imposing a probability distribution over possible EM results which could be useful in many settings (e.g., for user feedback). While such state-of-the-art ML approaches to EM have been shown to be very accurate in practice, they also typically require complex inference over very large model graphs; thus, their scalability to real-life datasets has remained a big challenge. Toward this end, in this work, we propose a principled framework to scale general, collective EM operators. Our framework is generic: it uses a black-box abstraction to incorporate any entity matcher. The main idea is to approximate the run of the entity matcher on the entire data set by: (1) running multiple instances of the matcher on several small subsets of the entities, and (2) Message-Passing, i.e., passing a judiciously-built message-set across the instances to control the interaction between different runs of the matcher. While the notion of communicating “blocks” of EM is not entirely new, our work is the first to carry out a complete, formal analysis of the above framework, and show that for a broad class of “well-behaved” entity matchers, the approach is provably sound. We also propose novel message-passing schemes for probabilistic EM tools that, as our results demonstrate, significantly improve EM recall without compromising soundness. Finally, we present experimental results demonstrating the effectiveness of our approach and its ability to scale to large real-life datasets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Centralized Clustering Method To Increase Accuracy In Ontology Matching Systems

Ontology is the main infrastructure of the Semantic Web which provides facilities for integration, searching and sharing of information on the web. Development of ontologies as the basis of semantic web and their heterogeneities have led to the existence of ontology matching. By emerging large-scale ontologies in real domain, the ontology matching systems faced with some problem like memory con...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

Convex Collective Matrix Factorization

In many applications, multiple interlinked sources of data are available and they cannot be represented by a single adjacency matrix, to which large scale factorization method could be applied. Collective matrix factorization is a simple yet powerful approach to jointly factorize multiple matrices, each of which represents a relation between two entity types. Existing algorithms to estimate par...

متن کامل

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have assumed the matching of two static databases. In our networked and online world, howev...

متن کامل

One Size Does Not Fit All: Customizing Ontology Alignment Using User Feedback

A key problem in ontology alignment is that different ontological features (e.g., lexical, structural or semantic) vary widely in their importance for different ontology comparisons. In this paper, we present a set of principled techniques that exploit user feedback to customize the alignment process for a given pair of ontologies. Specifically, we propose an iterative supervised-learning appro...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

PVLDB

دوره 4 شماره

صفحات -

تاریخ انتشار 2011

Large-Scale Collective Entity Matching

نویسندگان

چکیده

منابع مشابه

Centralized Clustering Method To Increase Accuracy In Ontology Matching Systems

Adaptive Approximate Record Matching

Convex Collective Matrix Factorization

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

One Size Does Not Fit All: Customizing Ontology Alignment Using User Feedback

عنوان ژورنال:

اشتراک گذاری